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ABSTRACT 

The performance of automatic speech recognition (ASR) has 
improved tremendously due to the application of deep neu¬ 
ral networks (DNNs). Despite this progress, building a new 
ASR system remains a challenging task, requiring various 
resources, multiple training stages and significant expertise. 
This paper presents our Eesen framework which drastically 
simplifies the existing pipeline to build state-of-the-art ASR 
systems. Acoustic modeling in Eesen involves learning a 
single recurrent neural network (RNN) predicting context- 
independent targets (phonemes or characters). To remove the 
need for pre-generated frame labels, we adopt the connection- 
ist temporal classification (CTC) objective function to infer 
the alignments between speech and label sequences. A dis¬ 
tinctive feature of Eesen is a generalized decoding approach 
based on weighted finite-state transducers (WFSTs), which 
enables the efficient incorporation of lexicons and language 
models into CTC decoding. Experiments show that com¬ 
pared with the standard hybrid DNN systems, Eesen achieves 
comparable word error rates (WERs), while at the same time 
speeding up decoding significantly. 

Index Terms — Recurrent neural network, connectionist 
temporal classification, end-to-end ASR 

1. INTRODUCTION 

Automatic speech recognition (ASR) has traditionally lever¬ 
aged the hidden Markov model/Gaussian mixture model 
(HMM/GMM) paradigm for acoustic modeling. HMMs act 
to normalize the temporal variability, whereas GMMs com¬ 
pute the emission probabilities of HMM states. In recent 
years, the performance of ASR has been improved dramat¬ 
ically by the introduction of deep neural networks (DNNs) 
as acoustic models E El E|- In the hybrid HMM/DNN 
approach, DNNs are used to classify speech frames into clus¬ 
tered context-dependent (CD) states (i.e., senones). On a 
variety of ASR tasks, DNN models have shown significant 
gains over the GMM models. Despite these advances, build¬ 
ing a state-of-the-art ASR system remains a complicated, 
expertise-intensive task. First, acoustic modeling typically 
requires various resources such as dictionaries and phonetic 
questions. Under certain conditions (e.g., in low-resource lan¬ 


guages), these resources may be unavailable, which restricts 
or delays the deployment of ASR. Second, in the hybrid 
approach, training of DNNs still relies on GMM models to 
obtain (initial) frame-level labels. Building GMM models 
normally goes through multiple stages (e.g.. Cl phone, CD 
states, etc.), and every stage involves different feature pro¬ 
cessing techniques (e.g., LDA, fMLLR, etc.). Third, the 
development of ASR systems highly relies on ASR experts 
to determine the optimal configurations of a multitude of 
hyper-parameters, for instance, the number of senones and 
Gaussians in the GMM models. 

Previous work has made various attempts to reduce the 
complexity of ASR. In mm, researchers propose to flat- 
start DNNs and thus get ride of GMM models. However, this 
GMM-free approach still requires iterative procedures such as 
generating forced alignments and decision trees. Meanwhile, 
another line of work ei 13 is muni mi ei has focused on 
end-to-end ASR, i.e., modeling the mapping between speech 
and labels (words, phonemes, etc.) directly without any in¬ 
termediate components (e.g., GMMs). On this aspect. Graves 
et al. ED introduce the connectionist temporal classification 
(CTC) objective function to infer speech-label alignments au¬ 
tomatically. This CTC technique is further investigated in 
El (3 El HD on large-scale acoustic modeling tasks. Although 
showing promising results, research on end-to-end ASR faces 
two major obstacles. First, it is challenging to incorporate 
lexicons and language models into decoding. When decod¬ 
ing CTC-trained models, past work El El HOI has success¬ 
fully constrained search paths with lexicons. However, how 
to integrate word-level language models efficiently still is an 
unanswered question uni. Second, the community lacks a 
shared experimental platform for the purpose of benchmark¬ 
ing. End-to-end systems described in the literature differ not 
only in their model architectures but also in their decoding 
methods. For example, E) and S) adopt two distinct ver¬ 
sions of beam search for decoding CTC models. These setup 
variations hamper rigorous comparisons not only across end- 
to-end systems, but also between the end-to-end and existing 
hybrid approaches. 

In this paper, we resolve these issues by presenting and 
publicly releasing our Eesen framework. Acoustic modeling 
in Eesen is viewed as a sequence-to-sequence learning prob¬ 
lem. We exploit deep recurrent neural networks (RNNs) 02 


x< hi-; 


m as the acoustic models, and the Long Short-Term Memory 
(LSTM) units Q71Q! [H G2 as the RNN building blocks. 
Using the CTC objective function, Eesen simplifies acous¬ 
tic modeling into learning a single RNN over pairs of speech 
and context-independent (Cl) label sequences. A distinctive 
feature of Eesen is a generalized decoding method based on 
weighted finite-state transducers (WFSTs). In this method, 
individual components (CTC labels, lexicons and language 
models) are encoded into WFSTs, and then composed into a 
comprehensive search graph. The WFST representation pro¬ 
vides a convenient way of handling the CTC blank label and 
enabling beam search during decoding. Our experiments with 
the Wall Street Journal (WSJ) benchmark show that Eesen 
results in superior performance than the existing end-to-end 
ASR pipelines 0D- The WERs of Eesen are on a par with 
strong hybrid HMM/DNN baselines. Moreover, the applica¬ 
tion of Cl modeling targets allows Eesen to speed up decoding 
and reduce decoding memory usage. Eesen is released as an 
open-source projecfj and will undertake continuous expan¬ 
sion and optimization. 

2. THE EESEN FRAMEWORK: MODEL TRAINING 

Acoustic models in Eesen are deep bidirectional RNNs 
trained with the CTC objective function fl3l . We describe the 
model structure in Section [Zl] and restate key points of CTC 
training in Section |2.2| Section [273] presents some practical 
considerations emerging from our GPU implementation. 

2.1. Deep Bidirectional Recurrent Neural Networks 

Compared to the standard feedforward networks, RNNs have 
the advantage of learning complex temporal dynamics on se¬ 
quences. Given an input sequence X = (xi, ...,x-r), a re¬ 
current layer computes the forward sequence of hidden states 
it = (hi,..., h t) by iterating from t = 1 to T: 

h i = al^hxX-t + hh h t -1 + b h) (1) 

where X^hx is the input-to-hidden weight matrix, Whh is the 
hidden-to-hidden weight matrix. In addition to the inputs x t , 
the hidden activation h, _i from the previous time step are fed 
to influence the hidden outputs at the current time step. In a 
bidirectional RNN, an additional recurrent layer computes the 
backward sequence of hidden outputs IT from t = T to 1: 

1h"t = <j(^hxX-t + (2) 

Our acoustic model is a deep architecture, in which we stack 

multiple bidirectional recurrent layers. At each frame t, the 

concatenation of the forward and backward hidden outputs 
— y — 

[h*, h , j from the current layer are treated as inputs into the 
next recurrent layer. 

1 https://github.com/yajiemiao/eesen 



Fig. 1. A memory block of LSTM. 


Learning of RNNs can be done using back-propagation 
through time (BPTT). In practice, training RNNs to learn 
long-term temporal dependency can be difficult due to the 
vanishing gradients problem ED- To overcome this issue, we 
apply the LSTM units ifTTl as the building blocks of RNNs. 
LSTM contains memory cells with self-connections to store 
the temporal states of the network. Also, multiplicative gates 
are added to control the flow of information. Fig. |T] depicts 
the structure of the LSTM units we use. The blue curves 
represent peephole connections ED that link the memory 
cells to the gates to learn precise timing of the outputs. The 
computation at the time step t can be formally written as 
follows. We omit the —> arrow for uncluttered formulation. 

it = tj(Wi X x t + Wit,ht_i + W ic c t _i + bt) (3a) 
ft = a(W /x Xt+W /ft h t _i+W /c c t _i+b / ) (3b) 
Cf = ft 0 Ct_i + it 0 <?!>(W cx x t + Wc^ht-i + b c ) (3c) 

Of = cr(W ox Xf + W 0 f,hf_i + W oc Ct + b D ) (3d) 

hf = Of © 4>{ct) (3e) 

where i t . Of, ft, Cf are the activation of the input gates, output 
gates, forget gates and memory cells respectively. The W. x 
weight matrices connect the inputs with the units, whereas 
the Wj, matrices connect the previous hidden states with the 
units. The W. c terms are diagonal weight matrices for peep¬ 
hole connections. Also, a is the logistic sigmoid nonlinearity, 
and cj) is the hyperbolic tangent nonlinearity. The computa¬ 
tion of the backward LSTM layer can be represented simi¬ 
larly. In this work, we use a purely LSTM-based architecture 
as the acoustic model. However, combing LSTMs with other 
network structures, e.g., time-delay Il23ll24l or convolutional 
neural networks 11251119H . is straightforward to achieve. 

2.2. Training with Connectionist Temporal Classification 

Unlike in the hybrid approach, the RNN model in our Eesen 
framework is not trained using frame-level labels with re¬ 
spect to the cross-entropy (CE) criterion. Instead, following 
mmm, we adopt the CTC objective fL3l to automatically 
learn the alignments between speech frames and their label 
sequences (e.g., phonemes or characters). Assume that the 
label sequences in the training data contain K unique la¬ 
bels. Normally K is a relatively small number, e.g., around 






45 for English when the labels are phonemes. An addi¬ 
tional blank label 0, which means no labels being emitted, 
is added to the labels. For simplicity of formulation, we 
denote every label using its index in the label set. Given an 
utterance X = (xi, its label sequence is denoted as 

z = (z \,.... zu). In our implementation, we always index 
the blank as 0. Therefore z u is an integer ranging from 1 to 
K. The length of z is constrained to be no greater than the 
length of the utterance, i.e., U < T. CTC aims to maximize 
hi Pr (z|X), the log-likelihood of the label sequence given the 
inputs, by optimizing the RNN model parameters. 

The final layer of the RNN is a softmax layer which has 
K+l nodes that correspond to the K+l labels (including 0). 
At each frame t, we get the output vector y t whose fc-th ele¬ 
ment yt is the posterior probability of the label k. However, 
since the labels z are not aligned to the frames, it is difficult to 
evaluate the likelihood of z given the RNN outputs. To bridge 
the RNN outputs with label sequences, an intermediate rep¬ 
resentation, the CTC path , is introduced in lfl3l. A CTC path 
p = (j>\ ,... ,pr ) is a sequence of labels at the frame level. 
It differs from z in that the CTC path allows occurrences of 
the blank label and repetitions of non-blank labels. The total 
probability of the CTC path is decomposed into the probabil¬ 
ity of the label p t at each frame: 


T 

Pr(p|X) = ni/T 

t =1 


(4) 


The label sequence z can then be mapped to its corresponding 
CTC paths. This is a one-to-multiple mapping because mul¬ 
tiple CTC paths can correspond to the same label sequence. 
For example, both “A A 0 0 B C 0” and “0 A A B 0 C C” 
are mapped to the label sequence “A B C”. We denote the set 
of CTC paths for z as 4>(z). Then, the likelihood of z can be 
evaluated as a sum of the probabilities of its CTC paths: 

Pr(z|X) = ]T Pr(p|X) (5) 

pG$(z) 

However, summing over all the CTC paths is computationally 
intractable. A solution is to represent the possible CTC paths 
compactly as a trellis. To allow blanks in CTC paths, we add 
“0” (the index of 0) to the beginning and the end of z, and 
also insert “0” between every pair of the original labels in z. 
The resulting augmented label sequence 1 = (/i,..., ( 2 ( 7 + 1 ) is 
leveraged in a forward-backward algorithm for efficient like¬ 
lihood evaluation. Specifically, in a forward pass, the variable 
a“ represents the total probability of all CTC paths that end 
with label l u at frame t. As with the case of HMMs ||26l , 
a“ can be recursively computed from a“_ x and a“JTi- Sim¬ 
ilarly, a backward variable /3“ carries the total probability of 
all CTC paths that starts with label l u at t and reaches the final 
frame T. The likelihood of the label sequence z can then be 


computed as: 


2 ( 7+1 

Pr(z|X) = £ a?#* 

U—l 


(6) 


where t can be any frame 1 < t < T. The objective 
In Pr (z | X) now becomes differentiable with respect to the 
RNN outputs y t . We define an operation on the augmented la¬ 
bel sequence T(l, k) = {u\l u = k } that returns the elements 
of 1 which have the value k. The derivative of the objective 
with respect to can be derived as: 


d In Pr (z|X) 
d'Vt 


1 1 

Pr(z\X)yl 


U QU 

a tPt 

ii£T(1 ,k) 


(7) 


These errors are back-propagated through the softmax layer 
and further into the RNN to update the model parameters. 


2.3. GPU Implementation 

We implement the training of the RNN models on GPU de¬ 
vices. To fully exploit the capacity of GPUs, multiple utter¬ 
ances are processed at a time in parallel. This parallel pro¬ 
cessing speeds up model training by replacing matrix-vector 
multiplication over single frames with matrix-matrix multi¬ 
plication over multiple frames. Within a group of parallel ut¬ 
terances, we pad every utterance to the length of the longest 
utterance in the group. These padding frames are excluded 
from gradients computation and parameter updating. For fur¬ 
ther acceleration, the training utterances are sorted by their 
lengths, from the shortest to the longest. The utterances in the 
same group then have approximately the same length, which 
minimizes the number of padding frames. To ensure training 
stability, the gradients of RNN parameters are clipped to the 
range of [-50, 50], 

CTC learning is also expensive because the forward and 
backward vectors (a t and (3 t ) have to be computed sequen¬ 
tially, either from t = 1 to T or from t = T to 1. Fike in 
RNNs, our implementation of CTC also processes multiple 
utterances at the same time. Moreover, at a specific frame t, 
the elements of at t (and (3 t ) are independent and thus can be 
computed in parallel. 


3. THE EESEN FRAMEWORK: DECODING 
3.1. Decoding with WFSTs 

Previous work has introduced a variety of methods mmm 
to decode CTC-trained models. These methods, however, 
either fail to integrate word-level language models mo or 
achieve the integration under constrained conditions (e.g., 11 - 
best list rescoring in J6l). In this work, we propose a general¬ 
ized decoding approach based on WFSTs 1 27] [281 . A WFST 
is a finite-state acceptor (FSA) in which each transition has 
an input symbol, an output symbol and a weight. A path 
through the WFST takes a sequence of input symbols and 




Z: <eps> 




IH: is 



Fig. 3. The WFST for the phoneme-lexicon entry “is IH Z”. 
The “<eps>” symbol means no inputs are consumed or no 
outputs are emitted. 


Fig. 2. A toy example of the grammar (language model) 
WFST. The arc weights are the probability of emitting the 
next word when given the previous word. The node 0 is the 
start node, and the double-circled node is the end node. 


<eps> : <eps> <eps> : <eps> 



<space> : <eps> <space> : <eps> 


emits a sequence of output symbols. Our decoding method 
represents the CTC labels, lexicons and language models as 
separate WFSTs. Using highly-optimized FST libraries such 
as OpenFST |29| . we can fuse the WFSTs efficiently into a 
single search graph. Building of the individual WFSTs is de¬ 
scribed as follows. Although exemplified in the scenario of 
English, the same procedures hold for other languages. 

Grammar. A grammar WFST encodes the permissible 
word sequences in a language/domain. The WFST shown in 
Fig. [2] represents a toy language model which permits two 
sentences “how are you” and “how is it”. The WFST symbols 
are the words, and the arc weights are the language model 
probabilities. With this WFST representation, CTC decod¬ 
ing in principle can leverage any language models that can be 
converted into WFSTs. Following conventions in the litera¬ 
ture lf28ll . the language model WFST is denoted as G. 

Lexicon. A lexicon WFST encodes the mapping from se¬ 
quences of lexicon units to words. Depending on what labels 
our RNN has modeled, there are two cases to consider. If 
the labels are phonemes, the lexicon is a standard dictionary 
as we normally have in the hybrid approach. When the la¬ 
bels are characters, the lexicon simply contains the spellings 
of the words. A key difference between these two cases is 
that the spelling lexicon can be easily expanded to include 
any out-of-vocabulary (OOV) words. In contrast, expansion 
of the phoneme lexicon is not so straightforward. It relies on 
some grapheme-to-phoneme rules/models, and is potentially 
subject to errors. The lexicon WFST is denoted as L. Fig. [3] 
and [4] illustrate these two cases of building L. 

For the spelling lexicon, there is another complication to 
deal with. With characters as CTC labels, we usually insert 
an additional space character between every pair of words, 
in order to model word delimiting in the original transcripts. 
During decoding, we allow the space character to optionally 
appear at the beginning and end of a word. This complication 
can be handled easily by the WFST shown in Fig. [4] 

Token. The third WFST component maps a sequence of 
frame-level CTC labels to a single lexicon unit (phoneme or 
character). For a lexicon unit, its token WFST is designed to 
subsume all of its possible label sequences at the frame level. 
Therefore, this WFST allows occurrences of the blank label 
0, as well as repetitions of any non-blank lables. For exam- 


Fig. 4. The WFST for the spelling of the word “is”. We allow 
the word to optionally start and end with the space character 

“<space>”. 

pie, after processing 5 frames, the RNN model may generate 
3 possible label sequences “AAAAA”, “0 0 A A 0”, “0 A 
A A 0”. The token WFST maps all these 3 sequences into a 
singleton lexicon unit “A”. Fig. [5] shows the WFST structure 
for the phoneme ”IH”. We denote the token WFST as T. 

Search Graph. After compiling the three individual WF¬ 
STs, we compose them into a comprehensive search graph. 
The lexicon and grammar WFSTs are firstly composed. Two 
special WFST operations, determinization and minimization , 
are performed over the composition of them, in order to com¬ 
press the search space and thus speed up decoding. The re¬ 
sulting WFST LG is then composed with the token WFST, 
which finally generates the search graph. Overall the oder of 
the FST operations is: 

S = T o min{det{L o G)) (8) 

where o, det and min denote composition, determinization 
and minimization respectively. The search graph S encodes 
the mapping from a sequence of CTC labels emitted on 
speech frames to a sequence of words. 

3.2. Posterior Normalization 

When decoding the hybrid DNN models, we need to scale the 
states posteriors from the DNNs using states priors. The pri¬ 
ors are usually estimated from the forced alignments of the 
training data. During decoding of the CTC-trained models, 
we adopt a similar procedure. Specifically, we run the final 


<blank>:<blank> IH:<eps> <blank>:<blank> 



Fig. 5. An example of the token WFST which depicts the 
phoneme “IH”. We allow the occurrences of the blank label 
“<blank>” and the repetitions of the non-blank label “IH”. 












RNN model over the training set for a propagation pass. La¬ 
bels with the largest posteriors are picked as the frame-level 
alignments, from which priors of the labels are estimated. 
However, this method does not perform well in our experi¬ 
ments. Part of the reason is that the softmax-layer outputs 
from a CTC-trained model display a highly peaky distribu¬ 
tion 0001- That is, a majority of the frames have the blank 
as their labels. The activation of the non-blank labels only 
appears in a narrow region along the time axis. This causes 
the prior estimates to be dominated by the count of the blank. 

Alternatively, we propose to estimate more robust label 
priors from the label sequences in the training data. As men¬ 
tioned in Section 2.2 the label sequences actually used by 
CTC training are the augmented label sequences, which in¬ 
sert the blank at the beginning, at the end, and between every 
label pair in the original label sequences. We compute the pri¬ 
ors from the augmented label sequences (e.g., ”0 IH 0 Z 0”), 
instead of the original ones (e.g., ’TH Z”), through simple 
counting. In our experiments, this simple method gives bet¬ 
ter recognition accuracy than both the aforementioned frame- 
alignment method and also the proposal described in GD. 


4. EXPERIMENTS 
4.1. Experimental Setup 

The experiments are conducted on the Wall Street Journal 
(WSJ) corpus that can be obtained from LDC under the cata¬ 
log numbers LDC93S6B and LDC94S13B. Data preparation 
gives us 81 hours of transcribed speech, from which we select 
95% as the training set and the remaining 5% for cross val¬ 
idation. As discussed in Section [2] we apply deep RNNs as 
the acoustic models. Inputs of the RNNs are 40-dimensional 
interbank features together with their first and second-order 
derivatives. The features are normalized via mean subtraction 
and variance normalization on the speaker basis. 

Initial values of the model parameters are randomly drawn 
from a uniform distribution with the range [—0.1, 0.1]. The 
model is trained with BPTT, in which the errors are back- 
propagated from CTC. Utterances in the training set are sorted 
by their lengths, and 10 utterances are processed in parallel 
at a time. The error rate of the hypothesized labels is mon¬ 
itored to determine learning rates. The hypothesized labels 
are formed by firstly picking the frame-level labels (the label 
with the largest probability at every frame), and then remov¬ 
ing blanks and label repetitions. The label error rate (LER) 
can be obtained in the same manner as WER, i.e., computing 
the edit distance between the hypothesized labels and the ref¬ 
erence. We adopt a decaying “newbob” learning rate sched¬ 
ule based on LERs. Specifically, the learning rate starts from 
0.00004 and remains unchanged until the drop of LER on 
the validation set between two consecutive epochs falls be¬ 
low 0.5%. Then the learning rate is decayed by a factor of 0.5 
at each of the subsequent epochs. The whole learning process 


terminates when the LER fails to decrease by 0.1% between 
two successive epochs. 

Our decoding follows the WFST-based approach in Sec¬ 
tion [3] After posterior normalization, the acoustic model 
scores need to be scaled down. The scaling factor lies be¬ 
tween 0.5 and 0.9, and its optimal value is decided empiri¬ 
cally. We apply the WSJ standard pruned trigram language 
model in the ARPA format (which we will consistently refer 
to as standard). To be consistent with previous work IS ED, 
we report our results on the eval92 set. Our experimental 
setup has been released together with Eesen, which enables 
the readers to reproduce our numbers easily. 

4.2. Phoneme-based Systems 

We explore the optimal RNN configurations on the phoneme- 
based systems. When phonemes are taken as CTC labels, 
we employ the CMU dictionary as the lexicon. Due to the 
lack of forced alignments, CTC training cannot handle mul¬ 
tiple pronunciations for the same word. For every word, we 
only keep its first pronunciation in the lexicon and remove all 
the other alternatives. From the lexicon, we extract 72 labels 
including phonemes, noise marks and the blank. Our best¬ 
performing model has 4 bi-directional LSTM layers. At each 
layer, both the forward and the backward sub-layers contain 
320 memory cells. Model training ends up to reach the LER 
(phone error rate in this setting) of 8.8% on the validation set. 
On the eval92 testing set, the Eesen end-to-end system finally 
achieves the WER of 7.87%, with both the lexicon and the 
language model used in decoding. When only the lexicon is 
used, our decoding behaves similarly as the beam search in 
|f6j . In this case, the WER rises quickly to 26.92%. This ob¬ 
vious degradation reveals the effectiveness of our decoding 
approach in integrating language models. 

Table [I] shows a comparison between Eesen and a hy¬ 
brid HMM/DNN system. The hybrid system is constructed 
by following the standard Kaldi recipe “s5” ll28l . Inputs of 
the DNN model are 11 neighboring frames of filterbank fea¬ 
tures. The DNN has 6 hidden layers and 1024 units at each 
layer. This DNN model contains slightly more parameters 
(9.2 vs 8.5 million) than the Eesen RNN model. Parameters of 
the DNN are initialized with restricted Boltzmann machines 
(RBMs) that are pre-trained in a greedy layerwise fashion 
l30l . The DNN is fine-tuned to optimize the CE objective 
with respect to 3421 senones. For fair evaluations, we de¬ 
code the DNN model using the original lexicon, rather than 
the expanded lexicon used by the Kaldi recipe. From Table 
|T| we observe that the performance of the Eesen system is 
still behind the hybrid HMM/DNN system. Our recent de¬ 
velopments of Eesen reveal that CTC-trained models outper¬ 
form the existing hybrid systems on large-sized datasets, e.g.. 
Switchboard. Interested readers may refer to the Eesen repos¬ 
itory for the updates. 

2 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 




A major advantage of Eesen compared with the hybrid ap¬ 
proach is the decoding speed. The acceleration comes from 
the drastic reduction of the number of states, i.e., from thou¬ 
sands of senones to tens of phonemes/characters. To verify 
this. Table [2] compares the decoding speed of the Eesen and 
the hybrid HMM/DNN systems under their best decoding set¬ 
tings. From their real-time factors, we observe that decoding 
in Eesen is 3.2x faster than that of HMM/DNN. Also, the 
decoding graph (TLG) in Eesen is significantly smaller than 
the graph (HCLG) used by HMM/DNN, which saves the disk 
space for storing the graphs. 


Table 1. Performance of the phoneme-based Eesen system, 
and its comparison with the hybrid HMM/DNN system built 
with Kaldi. “#Param” means the number of parameters. 


Model 

LM 

#Param 

WER% 

Eesen RNN 

lexicon 

8.5M 

26.92 

Eesen RNN 

trigram 

8.5M 

7.87 

Hybrid HMM/DNN 

trigram 

9.2M 

7.14 


Table 2. Comparisons of decoding speed between the 
phoneme-based Eesen system and the hybrid HMM/DNN 
system. “RTF” refers to the real-time factor in decoding. 
“Graph Size ” means the size of the decoding graph in terms 
of megabytes. 


Model 

RTF 

Graph Size 

Eesen RNN 

0.64 

263 

Hybrid HMM/DNN 

2.06 

480 


4.3. Character-based Systems 


We apply the same RNN architecture discussed in Section 4.2 
to modeling characters. We take the word list from the CMU 
dictionary as our vocabulary, ignoring the word pronuncia¬ 
tions. CTC training deals with 59 labels including letters, 
digits, punctuation marks, etc. Table [3] shows that with the 
standard language model, the character-based system gets the 
WER of 9.07%. 

CTC experiments in past work 0 have adopted an ex¬ 
panded vocabulary, and re-trained the language model using 
text data released together with the WSJ corpus. For fair com¬ 
parison, we follow the identical configuration. OOV words 
that occur at least twice in the language model training texts 
are added to the vocabulary. A new trigram language model 
is built (and then pruned) with the language model training 
texts. Under this setup, the WER of the Eesen character-based 
system is reduced to 7.34%. 

Table [3] lists the results of end-to-end ASR systems that 
have been reported in the previous work mm and on the same 
dataset. Our Eesen framework outperforms both 0 and (8) 
in terms of WERs on the testing set. It is worth pointing out 


that the 8.7% WER reported in 0 is obtained not in a purely 
end-to-end manner. Instead, the authors of 0 generate a n- 
best list of hypotheses from a hybrid DNN model, and apply 
the CTC model to rescore the hypotheses candidates. Our 
Eesen numbers, in contrast, come from a completely end-to- 
end pipeline, without any intervention from GMM or hybrid 
DNN models. 


Table 3. Performance of the character-based Eesen sys¬ 
tem using different vocabularies and language models, and 
its comparison with results presented in previous work. 


System 

Vocabulary 

Language Model 

WER% 

Eesen 

Original 

Standard 

9.07 

Eesen 

Expanded 

Re-trained 

7.34 

Graves et al. 0 

Expanded 

Re-trained 

8.7 

Hannun et al. 0 

Original 

Unknown 

14.1 


5. CONCLUSIONS AND FUTURE WORK 

In this work, we present our Eesen framework to build end-to- 
end ASR systems. Eesen exploits deep RNNs as the acoustic 
models and CTC as the training objective function. We train 
the RNN models in a single step, and thus are able to reduce 
the complexity of ASR system development. The WFST- 
based decoding enables efficient and effective incorporation 
of lexicions and language models. Because of its open-source 
property, Eesen can serve as a shared benchmark platform for 
research on end-to-end ASR. 

In our future work, we plan to further improve the WERs 
of Eesen systems via more advanced learning techniques 
(e.g., expected transcription loss in 0) and alternative de¬ 
coding approach (e.g., dynamic decoders OTl ). Also, we are 
interested to apply Eesen to various languages |32] EMU 
and different types (e.g., noisy, far-held) of speech, and inves¬ 
tigate how end-to-end ASR performs under these conditions. 
Moreover, due to the removal of GMMs, acoustic modeling 
in Eesen cannot leverage speaker adapted front-ends. We will 
study new speaker adaptation l35l [36 1 and adaptive training 
SZIESl techniques for the CTC models. 
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